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Abstract 

Neural network models of early sensory processing typically reduce the dimension¬ 
ality of streaming input data. Such networks learn the principal subspace, in the sense 
of principal component analysis (PCA), by adjusting synaptic weights according to 
activity-dependent learning rules. When derived from a principled cost function these 
rules are nonlocal and hence biologically implausible. At the same time, biologically 
plausible local rules have been postulated rather than derived from a principled cost 
function. Here, to bridge this gap, we derive a biologically plausible network for sub¬ 
space learning on streaming data by minimizing a principled cost function. In a depar¬ 
ture from previous work, where cost was quantified by the representation, or reconstruc¬ 
tion, error, we adopt a multidimensional scaling (MDS) cost function for streaming data. 
The resulting algorithm relies only on biologically plausible Hebbian and anti-Hebbian 
local learning rules. In a stochastic setting, synaptic weights converge to a stationary 



state which projects the input data onto the principal subspace. If the data are generated 
by a nonstationary distribution, the network can track the principal subspace. Thus, our 
result makes a step towards an algorithmic theory of neural computation. 


1 Introduction 


Early sensory processing reduces the dimensionality of streamed inputs (Hyvarinen 


et al. 2009) as evidenced by a high ratio of input to output nerve fiber counts (Shep 


herd, 2003). For example, in the human retina, information gathered by ^125 million 


photoreceptors is conveyed to the Lateral Geniculate Nucleus through £=T million gan¬ 


glion cells (Hubei, 1995). By learning a lower-dimensional subspace and projecting the 


streamed data onto that subspace the nervous system de-noises and compresses the data 
simplifying further processing. Therefore, a biologically plausible implementation of 
dimensionality reduction may offer a model of early sensory processing. 

For a single neuron, a biologically plausible implementation of dimensionality re¬ 
duction in the streaming, or online, setting has been proposed in the seminal work of 
( [Oj'aj |T982| ), Figure |1}\. At each time point, t, an input vector, x t , is presented to the 
neuron, and, in response, it computes a scalar output, y t = wx t , were w is a row-vector 
of input synaptic weights. Furthermore, synaptic weights w are updated according to a 
version of Hebbian learning called Oja’s rule: 


w i- w + 772/t(x f - w y t ), 


( 1 ) 


where q is a learning rate and T designates a transpose. Then, the neuron’s synap¬ 
tic weight vector converges to the principal eigenvector of the covariance matrix of 
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the streamed data ( ]Ojaj , [1982). Importantly, Oja’s learning rule is local meaning that 
synaptic weight updates depend on the activities of only pre- and postsynaptic neurons 
accessible to each synapse and, therefore, biologically plausible. 

Oja’s rule can be derived by an approximate gradient descent of the mean squared 


representation error (Cichocki and Amari 2002 Yang 1995), a so-called synthesis 


view of principal component analysis (PCA) (Pearson 1901} Preisendorfer and Mobley, 


1988): 


mm 

W 


; n £ 


x t — w wx ( 


( 2 ) 


Computing principal components beyond the first requires more than one output 
neuron and motivated numerous neural networks. Some well-known examples are the 


Generalized Hebbian Algorithm (GHA) (Sanger, 1989), Foldiak’s network (Foldiak 


1989), the subspace network (Karhunen and Oja, 1982), Rubner’s network (Rubner and 


Tavan 1989, Rubner and Schulten 1990), Leen’s minimal coupling and full coupling 
networks (Leen, 1990 1991) and the APEX network (Kung and Diamantaras 1990 


Rung et al. 1994). We refer to (Becker and Plumbley, 1996 Diamantaras and Kung, 


1996; Diamantaras, 2002) for a detailed review of these and further developments. 


However, none of the previous contributions was able to derive a multineuronal 
single-layer network with local learning rules by minimizing a principled cost function, 
in a way that Oja’s rule ([1]) was derived for a single neuron. The GHA and the sub¬ 
space rules rely on nonlocal learning rules: feedforward synaptic updates depend on 
other neurons’ synaptic weights and activities. Leen’s minimal network is also nonlo¬ 
cal: feedforward synaptic updates of a neuron depend on its lateral synaptic weights. 


While Foldiak’s, Rubner’s and Leen’s full coupling networks use local Hebbian and 
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anti-Hebbian rules, they were postulated rather than derived from a principled cost 
function. APEX network, perhaps, comes closest to our criterion: the rule for each 
neuron can be related separately to a cost function which includes contributions from 
other neurons. But no cost function describes all the neurons combined. 

At the same time, numerous dimensionality reduction algorithms have been de¬ 
veloped for data analysis needs disregarding the biological plausibility requirement. 
Perhaps the most common approach is again PCA, which was originally developed 


for batch processing (Pearson 1901) but later adapted to streaming data (Yang 1995 


Crammer 2006, Arora et al. 2012; Goes et al. 2014). For a more detailed collec¬ 


tion of references, see e.g. fBalzano 2012). These algorithms typically minimize the 


representation error cost function: 


min X-F'FX 


(3) 


where X is a data matrix and F is a wide matrix (for detailed notation, see below). 
The minimum of ([3]) is when rows of F are orthonormal and span the m-dimensional 


principal subspace, and therefore F F is the projection matrix to the subspace (Yang 


1995 


A gradient descent minimization of such cost function can be approximately imple¬ 


mented by the subspace network (Yang |1995[ ), which, as pointed out above, requires 
nonlocal learning rules. While this algorithm can be implemented in a neural network 
using local learning rules, it requires a second layer of neurons (Oja 1992[), making it 


1 


Recall that, in general, the projection matrix to the row space of a matrix P is given 

-l 


by P T (PP T ) P, provided PP T is full rank (Plumbley 
orthonormal this reduces to P P. 


1995). If the rows of P are 
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less appealing. 

In this paper, we derive a single-layer network with local Hebbian and anti-Hebbian 
learning rules, similar in architecture to Foldiak’s (Foldiak 1989) (see Figure |TJ3 ), from 
a principled cost function and demonstrate that it recovers a principal subspace from 
streaming data. The novelty of our approach is that, rather than starting with the rep¬ 
resentation error cost function traditionally used for dimensionality reduction, such as 
PCA, we use the cost function of classical multidimensional scaling (CMDS), a member 


of the family of multidimensional scaling (MDS) methods (Cox and Cox 2000, Mar- 


dia et al. 1980). Whereas the connection between CMDS and PCA has been pointed 


out previously (Williams 2001 Cox and Cox 2000 Mardia et al. 1980), CMDS is 


typically performed in the batch setting. Instead, we developed a neural network imple¬ 
mentation of CMDS for streaming data. 

The rest of the paper is organized as follows. In Section 2, by minimizing the 
CMDS cost function we derive two online algorithms implementable by a single-layer 
network, with synchronous and asynchronous synaptic weight updates. In Section 3, 
we demonstrate analytically that synaptic weights define a principal subspace whose 
dimension m is determined by the number of output neurons and that the stability of 
the solution requires that this subspace corresponds to top m principal components. In 
Section 4, we show numerically that our algorithm recovers the principal subspace of 
a synthetic dataset, and does it faster than the existing algorithms. Finally, in Section 
5, we consider the case when data are generated by a nonstationary distribution and 
present a generalization of our algorithm to principal subspace tracking. 
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Figure 1: An Oja neuron and our neural network. A. A single Oja neuron computes 
the principal component, y, of the input data, x, if its synaptic weights follow Hebbian 
updates. B. A multineuron network computes the principal subspace of the input if 
the feedforward connection weight updates follow a Hebbian and the lateral connection 
weight updates follow an anti-Hebbian rule. 

2 Derivation of online algorithms from the CMDS cost 
function 


CMDS represents high-dimensional input data in a lower-dimensional output space 
while preserving pairwise similarities between samples^ Young and Householder 


1938 


Torgerson, 1952). Let T centered input data samples in R n be represented by column- 
vectors x.t=i,...,T concatenated into an n x T matrix X = [xi,..., x^]. The correspond¬ 
ing output representations in M m , m < n, are column-vectors, yt=i,...,T, concatenated 
into an m x T dimensional matrix Y = [yi,... ,yr]- Similarities between vectors in 
Euclidean spaces are captured by their inner products. For the input (output) data, such 


Whereas MDS in general starts with dissimilarities between samples that may not 
live in Euclidean geometry, in CMDS data are assumed to have a Euclidean representa¬ 


tion. 
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inner products are assembled into a T x T Gram matrixj^jx X (Y T Y). For a given X, 
CMDS finds Y by minimizing the so-called “strain” cost function (Carroll and Chang, 


1972): 


min ||X T X - Y'Y 


T -> 


(4) 


For discovering a low-dimensional subspace, the CMDS cost function ([4]) is a viable 
alternative to the representation error cost function (|3]) because its solution is related to 


PC A ( |Williams| |2001| |Cox and Cox[ |2000] |Mardia et al.[ |1980| ). Specifically, Y is the 
linear projection of X onto the (principal sub-)space spanned by m principal eigenvec¬ 
tors of the sample covariance matrix C T = j J2t =i x / x / = XX T . The CMDS cost 
function defines a subspace rather than individual eigenvectors because left orthogonal 
rotations of an optimal Y stay in the subspace and are also optimal, as is evident from 
the symmetry of the cost function. 

In order to reduce the dimensionality of streaming data, we minimize the CMDS 
cost function t[4]) in the stochastic online setting. At time T, a data sample, x T , drawn 
independently from a zero-mean distribution is presented to the algorithm which com¬ 
putes a corresponding output, y T prior to the presentation of the next data sample. 
Whereas in the batch setting, each data sample affects all outputs, in the online setting, 
past outputs cannot be altered. Thus, at time T the algorithm minimizes the cost de¬ 
pending on all inputs and ouputs up to time T with respect to y T while keeping all the 


’When input data are pairwise Euclidean distances, assembled into a matrix Q, the 


Gram matrix, X T X, can be constructed from Q by HZH, where Z tJ = —1/2Q 


* 7 ’ 


H = I n — 1 /nll T is the centering matrix, and I„ is the n dimensional identity matrix 


(Cox and Cox, 2000i Mardia et al. 1980). 
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previous outputs fixed: 


T T 


yx = argmin ||X T X — Y t Y||”, = argmin XX ( xjxt'-yjyt) , (5) 


Yt 


y t 


t =l t '=l 


where the last equality follows from the definition of the Frobenius norm. By keeping 
only the terms that depend on current output y-p we get: 

4x ? y ^ + 2y ? yr - 2||x T || 2 ||y T || 2 + ||y T | 

( 6 ) 


yx = arg mm 

Yt 


. t=i 


, i=l 


In the large-T limit, expression ([6]) simplifies further because the first two terms 
grow linearly with T, and therefore dominate over the last two. After dropping the last 
two terms we arrive at: 


yx = arg min 
yr 




yx + 2yJ 


V ^ ' 

2^ yty t J yr 


(7) 


We term the cost in expression ([7]) the “online CMDS cost”. Because the online 
CMDS cost is a positive semi-definite quadratic form in y T for sufficiently large T, 
this optimization problem is convex. While it admits a closed-form analytical solution 
via matrix inversion, we are interested in biologically plausible algorithms. Next, we 
consider two algorithms that can be mapped onto single-layer neural networks with 
local learning rules: coordinate descent leading to asynchronous updates and Jacobi 
iteration leading to synchronous updates. 


2.1 A neural network with asynchronous updates 

The online CMDS cost function ([7]) can be minimized by coordinate descent which at 
every step finds the optimal value of one component of yx while keeping the rest fixed. 
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The components can be cycled through in any order until the iteration converges to 


a fixed point. Such iteration is guaranteed to converge under very mild assumptions: 

T-1 

diagonals of y t yj has to be positive (Luo and Tseng, 1991), meaning that each 

t= 1 


output coordinate has produced at least one non-zero output before current time step T. 
This condition is almost always satisfied in practice. 

The cost to be minimized at each coordinate descent step with respect to z lh chan¬ 
nel’s activity is: 


Vt.% = arg mm 

UT,i 


<T -1 


'T -1 


■ 4x ? Xty < T yr + 2 yJ 


ytyj I yr 


. t=i 


,t=i 


Keeping only those terms that depend on y Ti yields: 


UT,i = arg mm 

VT,i 


( T -1 

^ Xt,kVt,i ) VT,i 
t =i 

+ 4 y yr,j (y yi,jVt A vr,i + 2 (y vtA yh 

^ \t= i / \t=i / 

By taking a derivative with respect to y T , and setting it to zero we arrive at the following 


closed-form solution: 


/T—l 


VT,i ~ 


E E y t ,, 

fc \t=i 


i X t,k I X T,fc 


/T—l 

E E y t .ah , 3 ) vt )3 


i+i Vi=l 


T—l 

E y t 2 ,i 


T—l 


( 8 ) 


E y?, 


To implement this algorithm in a neural network we denote normalized input-output 


and output-output covariances, 

T—l 

E y t ,i x t,k 
w T , ik = ^-, 

E yh 


T-l 

E VtM,- 


M — t_1 


M tm ~ 0) 


(9) 


E yh 


t =i t=i 

allowing us to rewrite the solution ([8]) in a form suggestive of a linear neural network: 


yr,i y W T ,ijX T ,j - y M Tjij y T j, 

3 =1 j=l 


( 10 ) 


9 













where W 7 - and M 7 represent the synaptic weights of feedforward and lateral connec¬ 
tions respectively, Figure IB. 

Finally, to formulate a fully online algorithm we rewrite ([9]) in a recursive form. 
This requires introducing a scalar variable D Ti representing cumulative activity of a 
neuron i up to time T — 1, 

T—l 

D T , = Y.vi,i (H) 

t =i 

Then, at each time point, T, after the output y T is computed by the network, the fol¬ 
lowing updates are performed: 


D T+ i,i D T ,i + VT,i 

Wt+ 1 ,ij WT,ij + UT,i { x T,j — WT,ijUT,i ) /^T+l,j 

Y- M Tji j + Ut.i (y T ,j — MT,ijVT,i ) /Ar+i,»- (12) 


Equations ( [TO] ) and ( [12] ) define a neural network algorithm that minimizes the on¬ 
line CMDS cost function ([7]) for streaming data by alternating between two phases: 
neural activity dynamics and synaptic updates. After a data sample is presented at time 
T, in the neuronal activity phase, neuron activities are updated one-by-one, i.e. asyn¬ 
chronously, ([TO]) until the dynamics converges to a fixed point defined by the following 
equation: 


y x — WyXy — 1Y1 tYt 


Yt — (I m + Mr) 1 WtXt, (13) 


where I m is the m-dimensional identity matrix. 

In the second phase of the algorithm, synaptic weights are updated, according to 
a local Hebbian rule ( fl2| ) for feedforward connections, and according to a local anti- 
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Hebbian rule (due to the (—) sign in equation ( [TO] )) for lateral connections. Interest¬ 
ingly, these updates have the same form as the single-neuron Oja’s rule ([]} ( |Oja[|1982j ), 
except that the learning rate is not a free parameter but is determined by the cumulative 
neuronal activity 1/D T+1 ^ To the best of our knowledge such single-neuron rule (Hu 


et al. 2013) has not been derived in the multineuron case. An alternative derivation of 


this algorithm is presented in Appendix | A. 1| 

Unlike the representation error cost function ([3]), the CMDS cost function ([4]) is 
formulated only in terms of input and output activity. Yet, the minimization with respect 
to Y recovers feedforward and lateral synaptic weights. 


2.2 A neural network with synchronous updates 

Here, we present an alternative way to derive a neural network algorithm from the large¬ 
ly limit of the online CMDS cost function (|7|) . By taking a derivative with respect to 
yr and setting it to zero we arrive at the following linear matrix equation: 


'T-l 


/ T— 1 


^ytyj yr = 5> x 7 x t, 


(14) 


j .=i 


. t=i 


We solve this system of equations using Jacobi iteration (Strang, 2009), by first split 


ting the output covariance matrix that appears on the left side of ( [14] ) into its diagonal 


Mhe single neuron Oja’s rule derived from the minimization of a least squares opti¬ 


mization cost function ends up with the identical learning rate (Diamantaras l 2002; Hu 


et al. 2013). Motivated by this fact, such learning rate has been argued to be optimal 


for the APEX network (Diamantaras and Kung[ 1996 Diamantaras, 2002) and used by 


others (Yang. 1995). 
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component Dy and the remainder R 7 : 


' T -1 


^2y t yJ = D t + R t , 


. t=i 


where the i th diagonal element of D 7 , D T ,i = Eii 1 y't.p as defined in ( [TT| ). Then, ( fT4[ ) 
is equivalent to: 


( T -1 


y T = D t x ^ y t xJ x T - D T 1 Rry 7 -. 


. t=i 


Interestingly, the matrices obtained on the right side are algebraically equivalent to 


the feedforward and lateral synaptic weight matrices defined in (|9|): 


( T -1 

W t = D 7 ‘ (y]y,x t T 


and Mr = D^R- 


T 


(15) 


,t=i 


Hence, the Jacobi iteration for solving ( [T4| ) 


yr 4 — Wj’X'jp — Mryr- 


(16) 


converges to the same fixed point as the coordinate descent, ( fT3j ). 

Iteration (JT6]) is naturally implemented by the same single-layer linear neural net¬ 
work as for the asynchronous update, Figure IB. For each stimulus presentation the 
network goes through two phases. In the first phase, iteration ( |T6| ) is repeated until con¬ 
vergence. Unlike the coordinate descent algorithm which updated activity of neurons 
one after another, here, activities of all neurons are updated synchronously. In the sec¬ 
ond phase, synaptic weight matrices are updated according to the same rules as in the 
asynchronous update algorithm ([12]). 

Unlike the asynchronous update ([TO]), for which convergence is almost always guar¬ 
anteed (Luo and Tseng| 19911, convergence of iteration ( |i~6| ) is guaranteed only when 
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the spectral radius of M is less than 1 (Strang 


2009). Whereas we cannot prove that this 


condition is always met, in practice, the synchronous algorithm works well. While in 
the rest of the paper, we consider only the asynchronous updates algorithm, our results 
hold for the synchronous updates algorithm provided it converges. 


3 Stationary synaptic weights define a principal sub¬ 


space 


What is the nature of the lower dimensional representation found by our algorithm? 
In CMDS, outputs y Tji are the Euclidean coordinates in the principal subspace of the 


input vector (Cox and Cox 2000 Mardia et al. 1980). While our algorithm uses the 
same cost function as CMDS, the minimization is performed in the streaming, or online, 
setting. Therefore, we cannot take for granted that our algorithm will find the principal 
subspace of the input. In this section, we provide analytical evidence, by a stability 
analysis in a stochastic setting, that our algorithm extracts the principal subspace of 
the input data and projects onto that subspace. We start by previewing our results and 
method. 


Our algorithm performs a linear dimensinality reduction since the transformation 
between the input and the output is linear. This can be seen from the neural activity 
fixed point ( fT3| ), which we rewrite as 


y T — F r x r , 


(17) 
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where Fy is a matrix defined in terms of the synaptic weight matrices W t and M r : 


F T (I m + M t ) -1 W- 


T- 


(18) 


Relation ( [T7] ) shows that the linear filter of a neuron, which we term a “neural filter”, is 
the corresponding row of Ft. The space that neural filters span, the rowspace of Fr, is 
termed a “filter space”. 

First, we prove that in the stationary state of our algorithm, neural filters are in¬ 
deed orthonormal vectors (section |T2| Theorem Q}. Second, we demonstrate that the 
orthonormal filters form a basis of a space spanned by some m eigenvectors of the co- 
variance of the inputs C (section [373] , Theorem [2]). Third, by analyzing linear perturba¬ 
tions around the stationary state, we find that stability requires these m eigenvectors to 
be the principal eigenvectors and, therefore, the filter space to be the principal subspace 
(section [3~4l Theorem [3]). 

These results show that even though our algorithm was derived starting from the 
CMDS cost function ([4]), F T converges to the optimal solution of the representation 
error cost function ([3]). This correspondence suggests that FJFt is the algorithm’s 
current estimate of the projection matrix to the principal subspace. Further, in (|3]), 
columns of F T are interpreted as data features. Then, columns of Fj, or neural filters, 
are the algorithm’s estimate of such features. 

Rigorous stability analyses of PCA neural networks (Oja 1982[ Oja and Karhunen 


1985; Sanger 1989; Oja 1992, Homik and Kuan 1992 Plumbley 1995) typically use 


the ODE method (Kushner and Clark 1978): Using a theorem of stochastic approxi¬ 


mation theory (Kushner and Clark 1978), the convergence properties of the algorithm 
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are determined using a corresponding deterministic differential equatiorj^] 

Unfortunately the ODE method cannot be used for our network. While the method 
requires learning rates that depend only on time, in our network learning rates (1 / Dx+i,i) 
are activity dependent. Therefore we take a different approach. We directly work with 
the discrete-time system, assume convergence to a “stationary state”, to be defined be¬ 
low, and study the stability of the stationary state. 


3.1 Preliminaries 

We adopt a stochastic setting where the input to the network at each time point, x t , is 
an //-dimensional i.i.d. random vector with zero mean, (x,) = 0, where brackets denote 
an average over the input distribution, and covariance C = (x t x7). 

Our analysis is performed for the “stationary state” of synaptic weight updates, i.e. 
when averaged over the distribution of input values, the updates on W and M average 
to zero. This is the point of convergence of our algorithm. For the rest of the section, 
we drop the time index T to denote stationary state variables. 

The remaining dynamical variables, learning rates I/Dt+ 1 ,%, keep decreasing at 
each time step due to neural activity. We assume that the algorithm has run for a suffi¬ 
ciently long time such that the change in learning rate is small and it can be treated as 
a constant for a single update. Moreover, we assume that the algorithm converges to a 
stationary point sufficiently fast such that the following approximation is valid at large 


’Application of stochastic approximation theory to PC A neural networks depends 


on a set of mathematical assumptions. See (Zufiria, 2002) for a critique of the validity 


of these assumptions and an alternative approach to stability analysis. 
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T: 


1 _ 1 1 

where y is calculated with stationary state weight matrices. 
We collect these assumptions into a definition. 

Definition 1 (Stationary State). In the stationary state, 

(A Wn) = (AMjj) = 0, 


A T ( y ,?) ’ 

with T large. 

The stationary state assumption leads us to define various relations between synaptic 
weight matrices, summarized in the following corollary: 

Corollary 1. In the stationary state, 

(yiXj) = (y 2 ) Wi j} (19) 

and 

(: ViVj) = (Vi) (■ M ij + Sij), ( 20 ) 

where Sy is the Kronecker-delta. 

Proof. Stationarity assumption when applied to the update rule on W ( |T2| ) leads imme¬ 
diately to ( [T9| ). Stationarity assumption applied to the update rule on M ( [l2| ) gives: 

(yiVj) = (hi) M v, 
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The last equality does not hold for i — j since diagonal elements of M are zero. To 


cover the case i = j, we add an identity matrix to M, and hence one recovers (20). □ 


Remark. Note that ( [20] ) implies (yf) M tJ = fyf) M ]% , i.e. that lateral connection 
weights are not symmetrical. 


3.2 Orthonormality of neural filters 

Here we prove the orthonormality of neural filters in the stationary state. First, we need 
the following lemma: 

Lemma 1. In the stationary state, the following equality holds: 

I m + M = WF t . (21) 

Proof By ( |20| ), (yf) (M lk + S ik ) = (yyy k ). Using y = Fx, we substitute for y k on the 
right hand side: (yf) (M rk + S ik ) = F kj (y,Xj). Next, the stationary condition ( fl9| ) 
yields: (yf) (M lk + S ik ) = (yf) ff , FkjWij. Canceling (yf) on both sides proves the 
Lemma. □ 


Now we can prove our theorem. 


Theorem 1. In the stationary state, neural filters are orthonormal: 


FF 1 = I m . 


( 22 ) 


Proof First, we substitute for F (but not for F T ) its definition ( [T8| ): FF T = (I m + M) 1 WF 
Next, using Lemma[lj we substitute WF T by (I m + M). The right hand side becomes 


I??i + M) (I m + M) — I. m . 


□ 


Remark. Theorem[I]implies that rank(F) 


= m. 
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3.3 Neural filters and their relationship to the eigenspace of the co- 
variance matrix 

How is the filter space related to the input? We partially answer this question in Theo¬ 
rem [2j using the following lemma: 

Lemma 2. In the stationary state, F T F and C commute: 

F t FC = CF t F. (23) 

Proof. See Appendix |A.2[ □ 

Now we can state our second theorem. 


Theorem 2. At the stationary state state, the filter space is an m-dimensional subspace 
in W that is spanned by some m eigenvectors of the covariance matrix. 


Proof. Because F T F and C commute (Lemma [ 2 ]), they must share the same eigenvec¬ 
tors. Equation ( |22| ) of Theorem[I]implies that m eigenvalues of F T F are unity and the 
rest are zero. Eigenvectors associated with unit eigenvalues span the rowspace of 
and are identical to some m eigenvectors of C. □ 


Which m eigenvectors of C span the filter space? To show that these are the eigen¬ 
vectors corresponding to the largest eigenvalues of C, we perform a linear stability 
analysis around the stationary point and show that any other combination would be 
unstable. 


^If this fact is not familiar to the reader, we recommend Strang’s (Strang 2009) 


discussion of Singular Value Decomposition. 


18 














3.4 Linear stability requires neural filters to span a principal sub¬ 


space 

The strategy here is to perturb F from its equilibrium value and show that the perturba¬ 
tion is linearly stable only if the row space of F is the space spanned by the eigenvectors 
corresponding to the m highest eigenvalues of C. To prove this result, we will need two 
more lemmas. 

Lemma 3. Let H be an rn x n real matrix with orthonormal rows and G is an (n — 
m) x n real matrix with orthonormal rows, whose rows are chosen to be orthogonal to 
the rows of H. Any n x rn real matrix Q can be decomposed as: 

Q = AH + SH + BG, 

where A is an m x rn skew-symmetric matrix, S is an m x m symmetric matrix and B 
is an m x (n — m) matrix. 

Proof. Define B := Q G T , A := § (Q H T - H Q T ) and S := \ (Q H T + H Q T ). 
Then, AH + SH + BG = Q (H T H + G T G) = Q. □ 

Let’s denote an arbitrary perturbation of F as (IF, where a small parameter is im¬ 
plied. We can use Lemma [3] to decompose i)F as 

+F = 5AF + 5SF + (iBG, (24) 

where the rows of G are orthogonal to the rows of F. Skew-symmetric 5 A corresponds 
to rotations of filters within the filter space, i.e. it keeps neural filters orthonormal. 
Symmetric ciS keeps the filter space invariant but destroys orthonormality. 5B is a 
perturbation that takes the neural filters outside of the filter space. 
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Next, we calculate how (IF evolves under the learning rule, i.e. (A<5F). 


Lemma 4 . A perturbation to the stationary state has the following evolution under the 
learning rule to linear order in perturbation and linear order in T 1 ; 


/A ST? \ 1 (Im + M) 


-1 

ik 


^ ^ SFkiCij ^ ^ 5FkiF rp Ci p F r j 


Ipr 


E Fk,SF rp Ci p F, 

Ipr 


rj 


Proof. Proof is provided in Appendix |A.3| 


(25) 


□ 


Now, we can state our main result in the following theorem: 


Theorem 3. The stationary state of neuronal filters F is stable, in large-T limit, only if 
the m dimensional filter space is spanned by the eigenvectors of the covariance matrix 
corresponding to the m highest eigenvectors. 


Proof Sketch. Full proof is given in Appendix |A.4[ Here we sketch the proof. 

To simplify our analysis, we choose a specific G in Lemma [3] without losing gener¬ 
ality. Let v 1 ’ - "’” be eigenvectors of C and v 1, ’" ,n be corresponding eigenvalues, labeled 
so that the first m eigenvectors span the row space of F (or filter space). We choose 


rows of G to be the remaining eigenvectors, i.e. G' := V" +1 


By extracting the evolution of components of (IF from (25) using (24), we are ready 


to state the conditions under which perturbations of F are stable. Mutlipying ( [25] ) on 
the right by G T gives the evolution of <5B: 


(A />Bj)=Y, P !M ^ere P’ k = 


1 ( (I m + M) 


-1 


T 


(vl) 


i^ v j+m _ X 


ik 
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Here we changed our notation to SB k j = 5B J k to make it explicit that for each j we 
have one matrix equation. These equations are stable when all eigenvalues of all P y are 
negative, which requires as shown in the Appendix |A.4| 

> {v m+1 ,...,v n } . 


This result proves that the perturbation is stable only if the filter space is identical to the 
space spanned by eigenvectors corresponding to the m highest eigenvalues of C. 


It remains to analyze the stability of 5 A and 5 S perturbations. Multiplying (25) on 


the right by F T gives, 


(A 5Aij) = 0 and (A 5Sij) = —SSij. 


6A perturbation, which rotates neural filters, does not decay. This behavior is inherently 
related to the discussed symmetry of the strain cost function (|4]) with respect to left 
rotations of the Y matrix. Rotated y vectors are obtained from the input by rotated 
neural filters and hence 5A perturbation does not affect the cost. On the other hand, 
AS destroys orthonormality and these perturbations do decay, making the orthonormal 
solution stable. □ 


To summarize our analysis, if the dynamics converges to a stationary state, neural 
filters form an orthonormal basis of the principal subspace. 


4 Numerical simulations of the asynchronous network 


Here, we simulate the performance of the network with asynchronous updates, G9 


and (|12j), on synthetic data. The data were generated by a colored Gaussian process 
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with an arbitrarily chosen “actual” covariance matrix. We choose the number of input 
channels, n = 64, and the number of output channels, m — 4. In the input data, 
the ratio of the power in first 4 principal components to the power in remaining 60 
components was 0.54. W and M were initialized randomly, the step size of synaptic 
updates were initialized to l/-Do,i = 0.1. Coordinate descent step is cycled over neurons 
until magnitude of change in yr in one cycle is less than 10 -5 times the magnitude of 


yr- 


We compared the performance of the asynchronous updates network, (JTO]) and ( fl2] ), 
with two previously proposed networks, APEX (jKung and Diamantaras, 1990 ;Kung 


et al.| 1994) and Foldiak’s ( jFoldiak[[l989[ ), on the same dataset, Figure [2] APEX net¬ 


work uses the same Hebbian/anti-Hebbian learning rules for synaptic weights, but the 
architecture is slightly different in that the lateral connection matrix, M, is lower tri¬ 
angular. Foldiak’s network has the same architecture as ours, Figure [lj3, and the same 
learning rules for feedforward connections. However, the learning rule for lateral con¬ 
nections is A Mij oc y t y :n unlike 0 For the sake of fairness, we applied the same 
adaptive step size procedure for all networks. As in < [I2| ), the stepsize for each neuron 
i at time T was l/D T +i,i, with D T += D T y + y^i- In fact, such learning rate has 


been recommended and argued to be optimal for the APEX network (Diamantaras and 
Kung, |T996| Diamantaras] |2002[ ), see also footnote [4} 


To quantify the performance of these algorithms, we used three different metrics. 
First is the strain cost function, ([4]). normalized by T 2 , Figure [2]^. Such normalization 
is chosen because the minimum value of offline strain cost is equal to the power con¬ 
tained in the eigenmodes beyond the top m: T 2 J2k=m +i { l,k ) 2 ’ where {V,..., v n } are 
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eigenvalues of sample covariance matrix C T ( |Cox and Cox[ |2000; M ardia et al4|1980| ). 
For each of the three networks, as expected, the strain cost rapidly drops towards its 
lower bound. As our network was derived from the minimization of the strain cost 
function, it is not surprising that the cost drops faster than in the other two. 

Second metric quantifies the deviation of the learned subspace from the actual prin¬ 
cipal subspace. At each T, the deviation is ||fJFt — V T V||^, where V is a m x n 
matrix whose rows are the principal eigenvectors, V T V is the projection matrix to the 
principal subspace. Ft is defined the same way for APEX and Foldiak networks as ours 
and FJFt is the learned estimate of the projection matrix to the principal subspace. 
Such deviation rapidly falls for each network confirming that all three algorithms learn 
the principal subspace, Figure [2)3. Again, our algorithm extracts the principal subspace 
faster than the other two networks. 

Third metric measures the degree of non-orthonormality among the computed neu¬ 
ral filters. At each T: 'Fy F 7 — I m ||^. Non-orthonormality error quickly drops for 
all networks, confirming that neural filters converge to orthonormal vectors, Figure [2]C. 
Yet again, our network orthonormalizes neural filters much faster than the other two 
networks. 


5 Subspace tracking using a neural network with local 
learning rules 

We have demonstrated that our network learns a linear subspace of streaming data gen¬ 
erated by a stationary distribution. But what if the data are generated by an evolving 
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Figure 2: Performance of the asynchronous neural network compared with existing 
algorithms. Each algorithm was applied to 40 different random data sets drawn from 
the same Gaussian statistics, described in text. Weight initializations were random. 
Solid lines indicate means and shades indicate standard deviations across 40 runs. All 
errors are in decibells (dB). For formal metric definitions, see text. A. Strain error as a 
function of data presentations. Dotted line is the best error in batch setting, calculated 
using eigenvalues of the actual covariance matrix. B. Subspace error as a function of 
data presentations. C. Non-orthonormality error as a function of data presentations. 
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distribution and we need to track the corresponding linear subspace? Using the al¬ 


gorithm (12) would be suboptimal because the learning rate is adjusted to effectively 


‘remember” the contribution of all the past data points. 

A natural way to track an evolving subspace is to “forget” the contribution of older 


data points (Yang 19951). In this Section, we derive an algorithm with “forgetting” 


from a principled cost function where errors in the similarity of old data points are 
discounted: 


T T 


y T = arg min * * ( X * Tx *' - ^^') ‘ 


(26) 


Yt 


t =1 t '=1 


where f3 is a discounting factor 0 < /3 < 1 with (3 = 1 corresponding to our original 
algorithm (|5|). The effective time scale of “forgetting” is: 


r := —1/ In f3. 


(27) 


By introducing a T x T-dimensional diagonal matrix /3t with diagonal elements 


0T.u = 0 1 we can rewrite ( |26| ) in a matrix notation: 


y T = arg min ||/3jX T X/3 T - /^Y T Y,5 r ||^ 


(28) 


y t 


A similar discounting was used in (Yang [1995 ) to derive subspace tracking algorithms 
from the representation error cost function, ([3]). 


To derive an online algorithm to solve (28) we follow the same steps as before. By 


keeping only the terms that depend on current output yr we get: 


yr = arg mm 
yr 


tT -1 


fT -1 


. t=l 


-4x? ( J2^ T ~^ t yJ ) y T + 2y? ^ 

-2||x T || 2 ||y T || 2 + ||y T || 4 ] . (29) 
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In fl29[ ), provided that past input-input and input-output outer products are not forgotten 
for a sufficiently long time, i.e. r >> 1, the first two terms dominate over the last two 


for large T. After dropping the last two terms we arrive at: 

fT-l \ /X —1 


yx = arg mm 

Yt 


-4xJ £ /9 2<T_ ' l x 1 y, T yr + 2yJ £ yr 


. (30) 


t=l / \ i=l 

As in the non-discounted case, minimization of the discounted online CMDS cost 


function by coordinate descent (30) leads to a neural network with asynchronous up¬ 


dates, 


VT,i <- W T,ij X Tj - Y M T,ijVT,j j 
3 = 1 j =1 

and by a Jacobi iteration - to a neural network with synchronous updates, 


y T W t x t — M^y r , 


(31) 


(32) 


with synaptic weight matrices in both cases given by: 

X—i x-i 

E 

t =i 




E 


W T,ii= T _ x 


Mr 


E /3 2(T -% 2 , 


t=l 


X-l 

E /? 2(T - ,) y, 2 i 


Mr# — 0. (33) 


t=l 


Finally, we rewrite ( [33] ) in a recursive form. As before, we introduce a scalar vari¬ 
able D? i representing the discounted cumulative activity of a neuron i up to time T — 1, 


x-i 




(34) 


t=l 


Then, the recursive updates are: 


D T+l,i P 2D T,i + VT,i 

w? +lti j <- + y T , (x TJ - /l4 + i,i 

^T+l,i,j^i MT,ij + VT,i ( 'VT,j ~ /^T+l,r 


(35) 
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These updates are local and almost identical to the original updates ( p~2j ) except the 
D^ +li update, where the past cumulative activity is discounted by /3 2 . For suitably 
chosen /3, the learning rate, 1/D^ +1 •, stays sufficiently large even at large-T, allowing 
the algorithm to react to changes in data statistics. 

As before, we have a two-phase algorithm for minimizing the discounted online 


CMDS cost function fl30| ). For each data presentation, first the neural network dynamics 
is run using either (fTi~[) or ([32]) until the dynamics converges to a fixed point. In the 


second step, synaptic weights are updated using < [35[ ). 

In Figure [3j we present the results of a numerical simulation of our subspace track¬ 
ing algorithm with asynchronous updates similar to that in Section [4] but for non¬ 
stationary synthetic data. The data are drawn from two different Gaussian distribu¬ 
tions: from T = 1 to T = 2500 - with covariance Ci, and from T = 2501 to 
T = 5000 - with covariance C 2 . We ran our algorithm with 4 different f3 factors, 
P = 0.998, 0.995, 0.99,0.98 (r = 499.5,199.5,99.5,49.5). 

We evaluate the subspace tracking performance of the algorithm using a modifica¬ 
tion of the subspace error metric introduced in Section [4} From T = 1 to T = 2500 the 
error is 11F*- V^Vi || 2 p , where Vi is am x n matrix whose rows are the principal 
eigenvectors of Ci. From T = 2501 to T = 5000 the error is ||F^F t — V^V 2 ||^, 
where V 2 isamxn matrix whose rows are the principal eigenvectors of C 2 . Figure 
[3j\ plots this modified subspace error. Initially, the subspace error decreases, reach¬ 
ing lower values with higher (3. Higher 3 allows for smaller learning rates allowing a 
fine-tuning of the neural filters and hence lower error. At T = 2501, a sudden jump 
is observed corresponding to the change in principal subspace. The network rapidly 
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corrects its neural filters to project to the new principal subspace and the error falls to 
before jump values. It is interesting to note that higher /3 now leads to a slower decay 
due to extended memory in the past. 

We also quantify the degree of non-orthonormality of neural filters using the non¬ 
orthonormality error defined in Section |4} Initially, the non-orthonormality error de¬ 
creases, reaching lower values with higher (3. Again, higher [5 allows for smaller learn¬ 
ing rates allowing a fine-tuning of the neural filters. At T — 2501, an increase in 
orthonormality error is observed as the network is adjusting its neural filters. Then, 
the error falls to before change values, with higher /3 leading to a slower decay due to 
extended memory in the past. 

6 Discussion 

In this paper, we made a step towards a mathematically rigorous model of neuronal di¬ 
mensionality reduction satisfying more biological constraints than was previously pos¬ 
sible. By starting with the CMDS cost function (|4]), we derived a single-layer neural 
network of linear units using only local learning rules. Using a local stability analysis, 
we showed that our algorithm finds a set of orthonormal neural filters and projects the 
input data stream to its principal subspace. We showed that with a small modification 
in learning rate updates, the same algorithm performs subspace tracking. 

Our algorithm finds the principal subspace, but not necessarily the principal com¬ 
ponents themselves. This is not a weakness since both the representation error cost ([3]) 
and CMDS cost (|4]) are minimized by projections to principal subspace and finding the 
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Figure 3: Performance of the subspace tracking asynchronous neural network with 
nonstationary data. The algorithm with different /3 factors was applied to 40 different 
random data sets drawn from the same nonstationary statistics, described in text. Weight 
initializations were random. Solid lines indicate means and shades indicate standard 
deviations. All errors are in decibells (dB). For formal metric definitions, see text. A. 
Subspace error as a function of data presentations. B. Non-orthonormality error as a 
function of data presentations. 
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principal components is not necessary. 


Our network is most similar to Foldiak’s network (Foldiak 1989), which learns 


feedforward weights by a Hebbian Oja rule and the all-to-all lateral weights by an anti- 
Hebbian rule. Yet, the functional form of the anti-Hebbian learning rule in Foldiak’s 
network, A My oc UiUj, is different from ours ( |T2j ) resulting in the following interest¬ 
ing differences: 1) Because the synaptic weight update rules in Foldiak’s network are 
symmetric, if the weights are initialized symmetric, i.e. M^ = M Jt , and learning rates 
are identical for lateral weights, they will stay symmetric. As mentioned above, such 
symmetry does not exist in our network ((|T2j) and (|20|)). 2) While in Foldiak’s network 


neural filters need not be orthonormal (Foldiak 1989, Leen, 1991), in our network they 


will be (Theorem [I]). 3) In Foldiak’s network output units are decorrelated (Foldiak 


1989), since in its stationary state (r/ii/j) = 0. This need not be true in our network. Yet, 
correlations among output units do not necessarily mean that information in the output 
about the input is reducec0 


Our network is similar to the APEX network (Kung and Diamantaras 1990) in the 


As pointed before (Linsker 


1988 


Plumbley 


1993 


1995 


Kung 2014), PCA max 


imizes mutual information between a Gaussian input, x, and an output, y = Fx, such 
that rows of F have unit norms. When rows of F are principal eigenvectors, outputs are 
principal components and are uncorrelated. However, the output can be multiplied by a 
rotation matrix, Q, and mutual information is unchanged, y' = Qy = QFx. y' is now 
a correlated Gaussian and QF still has rows with unit norms. Therefore, one can have 
correlated outputs with maximal mutual information between input and output, as long 


as rows of F span the principal subspace. 
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functional form of both the feedforward and the lateral weights. However the network 
architecture is different because the APEX network has a lower-triangular lateral con¬ 
nectivity matrix. Such difference in architecture leads to two interesting differences in 
the APEX network operation (Diamantaras and KungJ[l996): 1) The outputs converge 
to the principal components. 2) Lateral weights decay to zero and neural filters are the 
feedforward weights. In our network lateral weights do not have to decay to zero and 


neural filters depend on both the feedforward and lateral weights (18) 


In numerical simulations, we observed that our network is faster than Foldiak’s 
and APEX networks in minimizing the strain error, finding the principal subspace and 
orthonormalizing neural filters. This result demonstrates the advantage of our principled 
approach compared to heuristic learning rules. 

Our choice of coordinate descent to minimize the cost function in the activity dy¬ 
namics phase allowed us to circumvent problems associated with matrix inversion: 
y (I m + M) 'Wx. Matrix inversion causes problems for neural network implemen¬ 
tations because it is a non-local operation. In the absence of a cost function, Foldiak 
suggested to implement matrix inversion by iterating y Wx — My until conver¬ 


gence (Foldiak, 1989). We derived a similar algorithm using Jacobi iteration. However, 


in general, such iterative schemes are not guaranteed to converge (Hornik and Kuan 


1992). Our coordinate descent algorithm is almost always guaranteed to converge be¬ 
cause the cost function in the activity dynamics phase ([7]) meets the criteria in (Luo and 


Tseng, 1991 [ ). 

Unfortunately, our treatment still suffers from the problem common to most other 


biologically plausible neural networks (Hornik and Kuan, 1992): a complete global 
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convergence analysis of synaptic weights is not yet available. Our stability analysis 
is local in the sense that it starts by assuming that the synaptic weight dynamics has 
reached a stationary state and then proves that perturbations around the stationary state 
are stable. We have not made a theoretical statement on whether this state can ever 
be reached or how fast such a state can be reached. Global convergence results us¬ 


ing stochastic approximation theory are available for the single-neuron Oja rule (Oja 


and Karhunen 1985), its nonlocal generalizations (Plumbley 1995) and the APEX rule 
(Diamantaras and Kung[|T996l, however applicability of stochastic approximation the¬ 


ory was questioned recently (Zufiria 2002). Even though a neural network implemen¬ 


tation is unknown, Warmuth & Kuzmin’s online PCA algorithm stands out as the only 
algorithm for which a regret bound has been proved (Warmuth and Kuzmin, 2008). An 
asymptotic dependence of regret on time can also be interpreted as convergence speed. 

This paper also contributes to MDS literature by applying CMDS method to stream¬ 
ing data. However, our method has limitations in that to derive neural algorithms we 
used the strain cost (|4]) of CMDS. Such cost is formulated in terms of similarities, inner 
products to be exact, between pairs of data vectors and allowed us to consider a stream¬ 
ing setting where a data vector is revealed at a time. In the most general formulation of 
MDS pairwise dissimilarities between data instances are given rather than data vectors 


themselves or similarities between them (Cox and Cox 2000, Mardia et al. 1980). This 


generates two immediate problems for a generalization of our approach: 1) A mapping 
to the strain cost function ([4]) is only possible if the dissimilarites are Euclidean dis¬ 
tances (footnote [3]). In general, dissimilarities do not need be Euclidean or even metric 


distances (Cox and Cox 2000, Mardia et al. 1980) and one cannot start from the strain 
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cost ([4]) for derivation of a neural algorithm. 2) In the streaming version of the general 
MDS setting, at each step, dissimilarities between the current and all past data instances 


are revealed, unlike our approach where the data vector itself is revealed. It is a chal¬ 


lenging problem for future studies to find neural implementations in such generalized 
setting. 

The online CMDS cost functions {7]) and ( |30| ) should be valuable for subspace learn¬ 
ing and tracking applications where biological plausibility is not a necessity. Minimiza¬ 
tion of such cost functions could be performed much more efficiently in the absence of 
constraints imposed by biology^] It remains to be seen how the algorithms presented in 
this paper and their generalizations compare to state-of-the-art online subspace tracking 


algorithms from machine learning literature (Cichocki and Amari, 2002). 

Finally, we believe that formulating the cost function in terms of similarities sup¬ 
ports the possibility of representation invariant computations in neural networks. 
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A Appendix 


A.l Alternative derivation of an asynchronous network 


Here, we solve the system of equations ( fl4] ) iteratively (Strang 2009). First, we split 
the output covariance matrix that appears on the left-hand side of < [T4] ) into its diagonal 
component D 7 , a strictly upper triangular matrix Ur and a strictly lower triangular 
matrix L r : 


T—l 


y t yj — D r + Ur + Lr 

t =1 

Substituting this into ( [T4] ) we get: 

(Dr + ujL t ) yr = ((1 — oj) D t — wU T ) yr + oj ( ^ y t x7 ) x T , 

where oj is a parameter. We solve ( fl4] ) by iterating 

((1 - w) D T - uU T ) yr + oj [ y t *J ) x r 


(36) 


y t 


(Dr T ojL 


'T -1 


, t=\ 


<T -1 


(37) 


,t =i 


(38) 


T —1 


until convergence. If symmetric y,y/ is positive definite, the convergence is guar- 

t =i 


anteed for 0 < uj < 2 by the Ostrowski-Reich theorem (Reich 1949; Ostrowski, 1954). 
When uj — l the iteration ( |38] ) corresponds to the Gauss-Seidel method and when oj > 1 
- to the succesive overrelaxation method. The choice of oj for fastest convergence de¬ 
pends on the problem, and we will not explore this question here. However, values 


around 1.9 are generally recommended (Strang 2009). 


Because in (37) the matrix multiplying yr on the left is lower triangular and on the 


right is upper triangular, the iteration (38) can be performed component-by-component 
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(Strang, 2009): 


Uta 


(1 - x) y T ,i + x 


/T -1 

E E y t ,i x 

k \t =i 


t,k I X T,k 


T—l 


— cu- 


E 


/T—l 

E E yt,iVtj ) vtj 

jfr Vt=i_ 

T—l 

E 


(39) 


t=i t=i 

Note that y T , is replaced with its new value before moving to the next component. 

This algorithm can be implemented in a neural network 

n m 

VT,i (1 — w) UT,i + cu Wx,ijXT,j — IX (40) 

3 = 1 i =1 

where and Mr, as defined in ([9]), represent the synaptic weights of feedforward 
and lateral connections respectively. The case of uj < 1 can be implemented by a leaky 
integrator neuron. The u — 1 case corresponds to our original asynchronous algorithm, 
except that now updates are performed in a particular order. For the x > 1 case, which 
may converge faster, we do not see a biologically plausible implementation since it 
requires self-inhibition. 

Finally, to express the algorithm in a fully online form we rewrite ((9]) via recursive 


updates, resulting in (12). 
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A.2 Proof of Lemma [2] 


Proof of Lemma 2. In our derivation below, we use results from equations ( [18] ), ( [T9| ) 


and (20) of the main text. 


kl 


Y F ki (Vk^j) 

k 

(from ([18])) 

Y F ti {yl) W tj 

k 

(from ([T9])) 

Y Fki (Vk) ( M kp + S kp) F pj 

kp 

(from (|T8|)) 

Y Fh ( vl ) + tpk) F pj 

kp 

(from (|2Q])) 

E w » (yl) y 

p 

(from ([18])) 

y i ( ypXi ) Fpj 

p 

(from ([T9])) 

y ^ Fpk {■^k^i) Fpj y ^ F pk F p j (CF F^ „. 

(from ([18])) 


pk pk 


□ 


A.3 Proof of Lemma IH 


Here we calculate how ciF evolves under the learning rule, i.e. (A<5F) and derive equa¬ 


tion (25). 


First, we introduce some new notation to simplify our expressions. We define lateral 
synaptic weight matrix M with diagonals set to 1 as 


M := I m + M. 


(41) 
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We use ~ to denote perturbed matrices 


F := F + 6F. 

M := M + <5M, 


W := W + 5W, 

M := I + M = M + SM. (42) 


Note that when the network is run with these perturbed synaptic matrices, for input x, 
the network dynamics will settle to the fixed point 

y = M x Wx = Fx, (43) 

which is different from the fixed point of the stationary network, y = M _1 Wx = Fx. 
Now we can prove Lemma [4] 

Proof of Lemma [4] The proof goes in the following steps. 


1. Since our update rules are formulated in terms of W and M, it will be helpful to 


express c)F in terms of 5 W and <5M. The definition of F, equation ( [18] ), gives us the 
desired relation: 


(5M)F + M(<SF) = SW. 


(44) 


2. Next, we show that in the stationary state 


(A<SF) = M" 1 ((A5W) - (A5M) F) + O 



(45) 


Proof. Average changes due to synaptic updates on both sides of (44) are equal: 


A (<5M)F + M(5F) j = (A<5W). Noting that the unperturbed matrices are sta¬ 
tionary, i.e. (AM) = (AF) = (AW) = 0, one gets (A<SM) F + M (A<JF) = 
(A5W) + O ( T ~ 2 ), from which equation ([45]) follows. □ 
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3. Next step is to calculate (A5W) and (A<5M) using the learning rule, in terms of 


matrices W, M, C, F and <5F and plug the result into (45). This manipulation is 


going to give us the evolution of <5F equation, (25). 


First, (ASW) : 


(A SWu) = (A W i3 


T (yf) 

l 


u 


(yiXj) - (yf) Wi 

T / 2\ \ 22 Fik ( XkXj) - 22 F ik Fu ( X k Xl ) Wi 


ij 


(from ( |43T )) 


k kl 

'j 1 jy‘i\ | ^ ^ FifcCkj ^ FikFuC fc/VF-y 

rp , 2\ ( ^2 FikC'kj — 22 FikFuCklWij + 22 SFikCk 
Wi' \ k kl k 

-2 ^ SFnFaCuWij - 22 F^C^Wij ] (from ©) 


'kj 


kl 


kl 


T (yf) 


22 dFikCkj — 2 22 fiFikFuCkiWij 


kl 


22FikFaC kl 8W ij ) . 


(from ( p~9] )) 


kl 
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Next we calculate (A<5M) : 


(A SMij) = (a4 : 


T (yf) 

l 

TW) 


T (Vi) 

1 

TW) 


t ( y!) 


( yiVj) - < y 2 i ) Mijj - —5ij ($) (last term sets A M u = 0) 
y ^ F'ikFji ( x k xi ) y ^ FifcFn (jx k xi) Mij 

kl kl 

% ^ (aw) ) ( from <EH» 

kl / 

^ FikFjiCki — ^ FikFuCkiMij — Sij yy F ik FuCki 

kl kl kl 

Y, FikFjiCki — ^ F ik FuC k iMij — 8 % j yy F ik FuCk 

kl kl kl 

yy fiF ik FjiCki +yy F ik 8Fjic k i —2 yy sF^FuCkiM^ 

kl kl kl 

yy FikFuCkiSMij - 2 Sij yy 8F ik FuC kl ) (from ((42])) 

fci fcZ / 

yy 3F ik Fjic k i + yy F ik 8Fjic k i — 2 yy 8F ik Fnc k iMij 

kl kl kl 

yy F ik FiiC kl 5Mij - 28 ij yy 5F ik F u C k i ) . (from ([ 20 ])) 


'kl 


kl 


kl 


Plugging these in equation (45), we get 


<A 5 F y > = yy 


Ma 1 

T (Vi) 


yy - 2 yy 5F kl F kp c lp w kj - yy F k iF kp ci P 8Wi 


kj 


Ip 


Ip 


yy 8F k iF rp Ci p F r j yy F k i5F rp Ci p Frj 


Ipr 


Ipr 


2 yy 5F k iF kp Ci p M kr F r j + yy F k iF kp Ci p 8M kr F 7 

Ipr Ipr 


rj 


+2 E $kr$FkiFkpCipFt 

Ipr 


rj 




M kr and <5M fcr terms can be eliminated using the previously derived relations (18) 


and (]44|). This leads to a cancellation of some of the terms given above, and finally 
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we have 




Mg_ 

T{vl) 


E^«c«-E 3 Fki F rp C ip F r j 


Ipr 


^ ^ Fki5F rp Ci p F r j ^ ^ F kt F kp C[ p F£ k rSF 1 

Ipr Ipr 
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To proceed further, we note that: 


(yl) = (fcf t ) m , 


(46) 


which allows us to simplify the last term. Then, we get our final result: 


< A ^> = f £ 


1 — M- 1 


(vl) 
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□ 


A.4 Proof of Theorem [3| 


For ease of reference, we remind that in general <5F can be written as, 


(5F = 6A F + SS F + 6B G. 


(1241) 


Here, 6A is an m x m skew symmetric matrix, SS is an in x m symmetric matrix and 
c)B is an m x (n — rri) matrix. G is an (n — m) x n matrix with orthonormal rows. 
These rows are chosen to be orthogonal to the rows of F. Let v 1,, " ,n be the eigenvectors 
C and v 1 ’’" ,n be the corresponding eigenvalues. We label them such that F spans the 
same space as the space spanned by the first m eigenvectors. We choose rows of G to 
be the remaining eigenvectors, i.e. G T := [v m+1 ,..., v n ]. Then, for future reference, 

FG T = 0, GG t = l( n - m ), and ]T C ik G T kj = C lk v{ +m = F +m G] y (47) 

k k 
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We also remind the definition: 


M := I m + M. © 

Proof of Theorem^ Below, we discuss the conditions under which perturbations of F 
are stable. We work to linear order in T” 1 as stated in Theorem [3j We treat separately 
the evolution of 5 A, 5S and 4B under a general perturbation <5F . 

1. Stability of <5B 


1.1 Evolution of 5B is given by: 


(Aizy = i Y, Wf vl+m - s * 1 6B *i- 


(48) 


Proof Starting from ( [24] ) and using ( |47j ): 


(A SBij) = (A SFik) G 


^ V V 5F kl C lv G jv - ^SB ir 

T k V T 


Here the last line results from equation ( [47] ) applied to ([25]). Let’s look at the 
first term again using (|47|) and then (|24|), 




— v / J ■ '"<) i>i. 


Combining these give (48) 


□ 


1.2 When is (48) stable? Next, we show that stability requires 




. 771+1 .,77 
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For ease of manipulation, we express ( |48j ) as a matrix equation for each column 


of ()B. For convenience we change our notation to 5B kj = 5B 3 k 


(A 6B{) = '£ P * SB l 

k 

where P 3 k = 1= (O ik v J+m - 5 lk ) , and O ik = 

1 vJkl 

We have one matrix equation for each j. These equations are stable if all eigen¬ 
values of all PJ are negative. 

{eig(P)} < 0 =► {eig(O)} < —, j = m + l,...,n. 

v i 

=> (eig(0 -1 )} > vj, j = m + 1,..., n. 


1.3 If one could calculate eigenvalues of O ', the stability condition can be articu¬ 
lated. We start this calculation by noting that 

k k 

= J2 = Sij (from ©). (49) 

k 

Therefore, 

O -1 = (yy T > = fcf t . (50) 

Then, we need to calculate the eigenvalues of FCF T . They are: 

eiglO- 1 ) = {t>\ 

Proof. We start with the eigenvalue equation. 

FCFA = AA 
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Multiply both sides by F T : 


F t FCF t A = A (F t A) . 


Next, we use the commutation of F F and C, (23), and the orthogonality of 


neural filters. FF = I m , (22) to simplify the left hand side: 


F FCF A = CF FF A = C (F t A) . 


This implies that 


C (F t A) = A (F t A) 


(51) 


Note that by orthogonality of neural filters, the following is also true: 

F t F (F t A) = (F t A) . (52) 

All the relations above would hold true if A = 0 and (F t A) = 0, but this 
would require F (F t A) = A = 0, which is a contradiction. Then, ([37]) and 
( |52| ) imply that (F t A) is a shared eigenvector between C and F T F. F T F and 
C was shown to commute before and they share a complete set of eigenvectors. 
However, some n — m eigenvectors of C have zero eigenvalues in F T F. We 
had labeled shared eigenvectors with unit eigenvalue in F T F to be v 1 ,..., v m . 
The eigenvalue of (F t A) with respect to F T F is 1, therefore F T A is one of 
vi,..., v m . This proves that A = {V,..., v m } and 

eig(0 -1 ) = {v 1 ,..., . 


□ 
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1.4 From ( p0| ), it follows that for stability 


{u\...,u m } > {v m+1 ,... ,v n ) 


2. Stability of 8 A and 8 S 


Next, we check stabilities of 8 A and 8 S. 


(A 8Ay) + (A 8Sij) = ^ (A8F ik ) (from definition ( [24] )) 

k 

= E 7%- E F ^F, m C, m - i + SSy) 

= 4 £ ^ £ ( FCF A (K+ s%) - ■ 

(53) 


In deriving the last line, we used equations ( |24j ) and (|47J). The k summation was 
calculated before ( |49l ). Plugging this in ( |53j ), one gets 


(A SAij) + (A SSij) = ~ (SAj + 8A% + 8S i3 + 8S i3 ) = ~^8S l3 
=> (A 8Aij) = 0 (from skew symmetry of A) 

=> (A 5S i3 ) = Ap SS i3 • 


<5A perturbation, which rotates neural filters to other orthonormal basis within the 
principal subspace, does not decay. On the other hand, 3S destroys orthonormality 
and these perturbations do decay, making the orthonormal solution stable. 


Collectively, the results above prove Theorem [3} 


□ 
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A.5 Perturbation of the stationary state due to data presentation 

Our discussion of the linear stability of the stationary point assumed general perturba¬ 
tions. Perturbations that arise from data presentation, 

SF = AF. (54) 


form a restricted class of the most general case, and have special consequences. Focus¬ 
ing on this case, we show that data presentations do not rotate the basis for extracted 
subspace in the stationary state. 

We calculate perturbations within the extracted subspace. Using (|24[) and ([47]) 


5F F t 


aff t 

from (54) 

M’ 1 (aW-AMf) F t 

expand (fT8j) to first order in A 

M’ 1 (AWF T -AM) 

from (22). 


Let’s look at AW F T term more closely: 

(AW F T ) y = m ( Vi x k - tfW a ) F kj 

k 

( Vi F i kXk ~ y i WikF kj 

\ k k / 

= Vi (yiVk- yi^i. 


= Vi 


= A Mij. 


Plugging this back into (55) gives, 


5A + 5S = 0, 


<5A = 0, & 5S = 0, 


(56) 
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Therefore, perturbations that arise from data presentation do not rotate neural filter basis 

within the extracted subspace. This property should increase the stability of the neural 

filter basis within the extracted subspace. 
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